[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

marmbrus · 2014-09-13T19:18:41Z

This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala. Specifically, while resolution remains case insensitive, it is now case preserving.

This PR is a follow up to #2293 (and to a lesser extent #2262 #2334).

In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones. While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a LowerCaseSchema operator) it had the consequence of breaking assumptions made by MultiInstanceRelation. Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).

In this PR, I instead remove the need to insert LowerCaseSchema operators at all, by moving the concern of matching up identifiers completely into analysis. Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).

While it is rolled back in this PR, storing the analyzed plan might actually be a good idea. For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore. This can be addressed in a follow up PR.

Follow-ups:

Configurable case sensitivity
Consider storing analyzed plans for temp tables

marmbrus · 2014-09-13T19:18:53Z

/cc @liancheng

marmbrus · 2014-09-13T19:19:38Z

/cc @ericl

SparkQA · 2014-09-13T19:22:55Z

QA tests have started for PR 2382 at commit c2f2ec8.

This patch merges cleanly.

SparkQA · 2014-09-13T19:23:53Z

QA tests have finished for PR 2382 at commit c2f2ec8.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-13T19:24:13Z

QA tests have started for PR 2382 at commit 5b93711.

This patch merges cleanly.

SparkQA · 2014-09-13T19:25:11Z

QA tests have finished for PR 2382 at commit 5b93711.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-13T19:33:37Z

QA tests have started for PR 2382 at commit 5b93711.

This patch merges cleanly.

SparkQA · 2014-09-13T19:34:13Z

QA tests have started for PR 2382 at commit 219805a.

This patch merges cleanly.

SparkQA · 2014-09-13T21:22:16Z

QA tests have finished for PR 2382 at commit 5b93711.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
- class Dummy(object):
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

SparkQA · 2014-09-13T21:24:30Z

QA tests have finished for PR 2382 at commit 219805a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2014-09-15T08:49:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala

@@ -22,4 +22,9 @@ package org.apache.spark.sql.catalyst
 * Analysis consists of translating [[UnresolvedAttribute]]s and [[UnresolvedRelation]]s
 * into fully typed objects using information in a schema [[Catalog]].
 */
-package object analysis
+package object analysis {
+  type Resolver = (String, String) => Boolean


Resolver probably a general name, can we use a more precise name for this?

I think this will actually end up providing more general resolution functionality in the long term. I've added some scala doc for clarity though.

liancheng · 2014-09-16T01:55:14Z

LGTM except some minor issues mentioned in the comments :)

liancheng · 2014-09-17T02:17:24Z

Oh, one more thing, please help rename this test case:

spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

Line 525 in 0a7091e

    
           test("SPARK-3414 regression: should store analyzed logical plan when registering a temp table") {

SparkQA · 2014-09-18T22:59:32Z

QA tests have started for PR 2382 at commit 01cc29a.

This patch merges cleanly.

…sensitive resolution is still case preserving.

SparkQA · 2014-09-18T23:04:48Z

QA tests have started for PR 2382 at commit c21171e.

This patch merges cleanly.

SparkQA · 2014-09-18T23:53:59Z

QA tests have finished for PR 2382 at commit c21171e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

SparkQA · 2014-09-18T23:58:34Z

QA tests have finished for PR 2382 at commit 01cc29a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

cloud-fan · 2014-09-25T07:24:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+    (nestedFields, expression.dataType) match {
+      case (Nil, _) => expression
+      case (requestedField :: rest, StructType(fields)) =>
+        val actualField = fields.filter(f => resolver(f.name, requestedField))


There is a problem here. Currently a.b[1].c.d will be parsed as GetField(GetField(GetItem(Unresolved("a.b"), 1), "c"), "d") , so the case-sensitive-check only happens when resolve Unresolved("a.b") to GetField(Attribute("a"), "b"). Something like "SELECT a[0].A.A from nested" will fail for hql on case-sensitive-check.
I think we should also do this check in GetField.

Hmm, good point. Right now you can't make a SQLContext case insensitive, but when you can this will be problem. Maybe you should note this on SPARK-3617

Oh wait, sorry... Is that how the HiveQL parser will do it too? I'm not a huge fan of moving resolution logic into the expressions. What about a rule that only ran in case insensitive mode that fixes unresolved GetFields?

Yes, this bug exists in HiveQL. I have opened a PR to fix this(adding a rule to fix unresolved GetFields).#2543 Need your comments :)

Replace LowerCaseSchema with Resolver.

5b93711

marmbrus force-pushed the lowercase branch from c2f2ec8 to 5b93711 Compare September 13, 2014 19:20

style

219805a

chenghao-intel reviewed Sep 15, 2014
View reviewed changes

Address comments.

2de881e

liancheng mentioned this pull request Sep 17, 2014

[SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ... #2397

Closed

Merge remote-tracking branch 'origin/master' into lowercase

d4320f1

Ensure the resolver is used for field lookups and ensure that case in…

c21171e

…sensitive resolution is still case preserving.

marmbrus force-pushed the lowercase branch from 01cc29a to c21171e Compare September 18, 2014 23:00

asfgit closed this in 293ce85 Sep 20, 2014

marmbrus deleted the lowercase branch September 22, 2014 19:53

cloud-fan reviewed Sep 25, 2014
View reviewed changes

cloud-fan mentioned this pull request Sep 26, 2014

[SPARK-3698][SQL] Correctly check case sensitivity in GetField #2543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

marmbrus commented Sep 13, 2014

marmbrus commented Sep 13, 2014

marmbrus commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

chenghao-intel Sep 15, 2014

marmbrus Sep 16, 2014

liancheng commented Sep 16, 2014

liancheng commented Sep 17, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

cloud-fan Sep 25, 2014

marmbrus Sep 25, 2014

marmbrus Sep 25, 2014

cloud-fan Sep 26, 2014

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

Conversation

marmbrus commented Sep 13, 2014

marmbrus commented Sep 13, 2014

marmbrus commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

chenghao-intel Sep 15, 2014

Choose a reason for hiding this comment

marmbrus Sep 16, 2014

Choose a reason for hiding this comment

liancheng commented Sep 16, 2014

liancheng commented Sep 17, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

cloud-fan Sep 25, 2014

Choose a reason for hiding this comment

marmbrus Sep 25, 2014

Choose a reason for hiding this comment

marmbrus Sep 25, 2014

Choose a reason for hiding this comment

cloud-fan Sep 26, 2014

Choose a reason for hiding this comment